Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TEST][NO-MERGE] Stress test domain sockets #382

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

bell-db
Copy link
Contributor

@bell-db bell-db commented Sep 23, 2024

It's found that if we use domain sockets to get around a port conflict issue, the communication is still not very reliable on macOS. With short messages, it gets stuck 6 out of 100k times. With 100KB messages, it gets stuck very frequently (2 out of 100 times).

sbt "testOnly protocbridge.frontend.MacPluginFrontendSpec"
<timeout very frequently>

The same test (with nc -N) can pass on Linux (Ubuntu 20.04.6 LTS, Xeon(R) Platinum 8375C).

Netcat client missing EOF

It turns out that the netcat bundled with macOS is pretty old and buggy (there is no version number. man nc says 2001, while there is some speculation that it is from 2005). It gets stuck frequently being a domain socket client (the server is a reliable socat echo server), especially with large messages (100KB), which is evident with

brew install socat
bash ./domain_socket_stress_test.sh 100000 socat-echo nc
...
Iterations completed: 194
Starting the server (PID: 78689)
The server has started and is listening to the socket (PID: 78689)
Started dumping random bytes to the socket (PID: 78694)
<stuck very frequently>

nmap ncat or socat are, on the other hand, reliable clients:

brew install nmap
bash ./domain_socket_stress_test.sh 100000 socat-echo ncat
bash ./domain_socket_stress_test.sh 100000 socat-echo socat
<works>

However, neither is bundled on macOS.

Confusingly, this problem seems gone just by having a Scala server read timeout:

val client = serverSocket.accept()
client.setSoTimeout(60 * 1000)  // Or any time period long enough.

Netcat client incomplete message

Another (potentially unrelated) issue is that a netcat client pair can result in incomplete messages in bash scripts if the server doesn't send anything back (instead of e.g. echoing):

bash ./domain_socket_stress_test.sh 100000 nc-save nc
OR
bash ./domain_socket_stress_test.sh 100000 ncat-save nc
OR
bash ./domain_socket_stress_test.sh 100000 socat-save nc
OR
(nc -l -U "$SOCKET_PATH" > "$TEST_RESULT_PATH") &
sleep 1  # Wait for the server to start
(nc -U "$SOCKET_PATH" < "$TEST_FILE_PATH") &
wait $CLIENT_PID 2>/dev/null
wait $SERVER_PID 2>/dev/null
...
Error: Expected 100000 bytes, but read    11264 bytes

(There is no obvious way to implement an echo server with macOS netcat)

This doesn't reproduce directly in the Terminal or with nmap ncat / socat clients:

bash ./domain_socket_stress_test.sh 100000 nc-save ncat
OR
bash ./domain_socket_stress_test.sh 100000 nc-save socat
<works>

It's unclear the root cause but might have to do with the fact the server doesn't send anything back, causing an incorrectly early termination.

Ncat / Socat

While nmap ncat and socat clients are reliable on their own, the stress test can still fail due to stuck timeouts. It's unclear if there is a problem with the implementation here or the junixsocket library.

<change to ncat or socat>
sbt "testOnly protocbridge.frontend.MacPluginFrontendSpec"
<timeout occasionally>

Confusingly, this problem is also gone just by having the server read timeout mentioned before:

val client = serverSocket.accept()
client.setSoTimeout(60 * 1000)  // Or any time period long enough.

@bell-db bell-db force-pushed the bell-db/v0.9.7-domain-socket-stress-test branch from d0d7cd0 to 8bbb0a5 Compare September 23, 2024 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant